The Churn Prediction for SaaS Platform project delivers an end-to-end machine learning solution to forecast customer churn, enabling proactive retention strategies. It ingests customer data from MySQL, applies feature engineering on metrics like tenure, usage frequency, and support tickets, builds ensemble models using RandomForest and XGBoost, incorporates SHAP for explainability, and provides an interactive Streamlit dashboard for the retention team. The system achieves 87%+ accuracy (0.89 AUC-ROC with XGBoost), reduces churn by an estimated 25%, and ensures interpretability, completed over 7.5 months from April 2025 to November 2025.
The architecture follows a streamlined pipeline: data is extracted from MySQL via ETL processes, preprocessed with feature engineering and balancing, trained using ensemble ML models (RandomForest and XGBoost), explained with SHAP values, and visualized through a Streamlit dashboard for interactive analytics. This design ensures efficiency, scalability, and integration with existing infrastructure, focusing on churn definition (e.g., 30-day inactivity), model predictions, and actionable insights for retention teams.
The system uses Python for data processing and development, Scikit-Learn for RandomForest modeling and metrics, XGBoost for gradient boosting, LIME/SHAP for explainability, and MySQL for relational data storage and querying. Additional tools include Pandas for manipulation, Streamlit for the interactive dashboard, and Matplotlib for SHAP visualizations.
The churn model employs ensemble methods with RandomForest for robustness and XGBoost for complex interactions, trained on stratified splits (70/15/15). Features include tenure (days since signup), usage frequency (logins per period), support tickets (count of open/resolved), plus additional metrics like amount deviation or categoricals (one-hot encoded). SMOTE handles class imbalance, with SHAP providing global/local explanations, highlighting tenure and tickets as key drivers.
Data processing extracts from MySQL using SQL queries and Pandas, engineers features (e.g., tenure calculation, frequency aggregation), preprocesses with scaling and encoding, and handles imbalances via SMOTE. Models are trained with hyperparameter tuning and early stopping, predictions stored back in MySQL, and explanations generated via SHAP, ensuring data quality, anonymization for privacy, and efficient querying for dashboard integration.
Testing includes unit validation for features and models, integration checks for pipeline flow, performance tuning for AUC-ROC >0.85, and usability testing for dashboard (<5s response). Deployment integrates models and dashboard with MySQL, using a phased rollout with anonymization for privacy, bias checks via balanced sampling, and rollback options by reverting to baseline models if needed.
Post-go-live, monitor model accuracy and drift via periodic retraining on new data, dashboard usage logs, and MySQL query performance, aiming for >99% uptime and <5s responses. Maintenance includes quarterly updates for SHAP explanations, monthly bias audits, and cost controls, with alerts for low engagement patterns to trigger interventions.